Computer Assignment 4
In this project we work on a dataset containing information about properties and the price of the property. Since the dataset could be faulty like it could have missing datas, first we need to preprocess it. After preprocessing, we will proceed to try different models and afer tuning them, we use them for prediction of the test data We predict house prices for the houses that have not been priced.
Phase 0. In this phase we try to visualize the data set.
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelBinarizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import VotingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import make_scorer
from sklearn import preprocessing
import matplotlib.pyplot as plt
from sklearn import tree
import pandas as pd
import numpy as np
import math
import io
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()
q1.
train.describe()
Using describe we can have information about count(number of data in each row), mean(average of datas in a column) and other information sbout data set like std, min, max, count of datas below 25%, etc. This function shows the results for the numeric columns.
train.info()
info function gives of the number of datas we have in each row, which we can also see in the describe function. It helps us find out about the missig datas.
q2.
num_of_rows = len(train.iloc[:,0])
num_of_rows
Here we get the number of NANs in each column. Since there are columns except 'LotFrontage', 'GarageYrBlt', 'MasVnrType', 'MasVnrArea', 'Electrical' that use 'NA' we have to remove the NAs counted from those columns.
count_nan = train.isna().sum()
Here we show the counted NAs before cleaning.
count_nan = count_nan.sort_values()
count_nan.tail(50)
Here we have NAs after cleaning.
count_nan['BsmtQual']=0
count_nan['BsmtCond']=0
count_nan['BsmtFinType1']=0
count_nan['BsmtFinType2']=0
count_nan['BsmtExposure']=0
count_nan['GarageQual']=0
count_nan['GarageFinish']=0
count_nan['GarageType']=0
count_nan['GarageCond']=0
count_nan['FireplaceQu']=0
count_nan['Fence']=0
count_nan['Alley']=0
count_nan['MiscFeature']=0
count_nan['PoolQC']=0
count_nan.tail(50)
lost_data_each_feature_p = count_nan/num_of_rows
lost_data_each_feature_p.tail(30)
Here we produce the function that will be used later on.
def percent_lost_data(data):
num_of_row = len(data.iloc[:,0])
count_nans = data.isna().sum()
count_nans = count_nan.sort_values()
count_nans['BsmtQual']=0
count_nans['BsmtCond']=0
count_nans['BsmtFinType1']=0
count_nans['BsmtFinType2']=0
count_nans['BsmtExposure']=0
count_nans['GarageQual']=0
count_nans['GarageFinish']=0
count_nans['GarageType']=0
count_nans['GarageCond']=0
count_nans['FireplaceQu']=0
count_nans['Fence']=0
count_nans['Alley']=0
count_nans['MiscFeature']=0
count_nans['PoolQC']=0
lost_data_each_feature_p_f = count_nans/num_of_row
return lost_data_each_feature_p_f
percent_lost_data(train)
q3.
In this part, first we make the corrolation matrix and then draw a heatmap based on that.
def cor_show(df):
cor = df.corr()
plt.figure(figsize=(50,40))
plt.matshow(cor,cmap=plt.cm.Reds)
plt.colorbar()
plt.title('Correlation Matrix', fontsize=10)
plt.show()
#Correlation with output variable
cor_target = abs(cor['SalePrice'])
return cor_target
cor_target_train = cor_show(train)
relevant_features = cor_target_train[cor_target_train>0.5]
relevant_features
Having the correlations of features we can see that output, which is the price of each house, has the most correlation with what features. We can pick 'OverallQual', 'GrLivArea', 'GarageCars', and 'GarageArea' as the four features that are most influential on the price.
q4.
train_log_result=train
train_log_result = train_log_result.assign(SalePrice=np.log2(train['SalePrice']))
train_log_result
cor_target_train = cor_show(train_log_result)
#Selecting highly correlated features
relevant_features = cor_target_train[cor_target_train>0.5]
relevant_features
After performing log on the prices of houses we can see that four most correlated features are: 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea' . We can see the vaule of correlation have changed but results has stayed the same. This is because the prices have changed relatively and because we choose the features by comparing them with each other the result stays the same.
q5. NO, it is not enough since we have only considered numerical features. There may be some non-numerical features that may be more correlated with the price.
q6.
train.plot.scatter(x='OverallQual', y='SalePrice')
train.plot.hexbin(x='OverallQual', y='SalePrice', gridsize=20)
Having the results, we can see that 'OverallQual' has a somewhat exponential behavior in relation to price but the concentration is mainly on the middle numbers 4,5,6. When the overall material and finish of the house is high quality, we have a bigger range for housr prices. With this bahavior, we can see that the value for this feature is different for different prices. Hence, we can see that this value is actually very influential on the price. This feature and the price are heavily correlated, since knowing the 'OverallQual' we can to some degreee determine the price.
train.plot.scatter(x='GrLivArea', y='SalePrice')
train.plot.hexbin(x='GrLivArea', y='SalePrice', gridsize=20)
We can see that for 'GrLivArea' most houses have around 500 to 2000 square feet above grade (ground) living area. It is portraited that the more the 'GrLivArea', the more range of prices there is for the houses. With this scattered and likelike bahavior, we can see that the value for this feature is actually very influential on the price.
train.plot.scatter(x='GarageCars', y='SalePrice')
train.plot.hexbin(x='GarageCars', y='SalePrice', gridsize=20)
For Size of garage in car capacity, we can see that it is scattered somewhat evenly with the concentraition mainly on 3. Also, we have more varying prices for the houses that have more car capacity. We can see how this feature heavily influences the price.
train.plot.scatter(x='GarageArea', y='SalePrice')
train.plot.hexbin(x='GarageArea', y='SalePrice', gridsize=20)
we can see that Size of garage in square feet is also concentrated on the 200 to 600 and we have varying prices for houses with the same garage area. It is clear that this feature is very correlated with the price.
q7.
In my opinion, 'MSSubClass', 'OverallQual: ' ,and 'BsmtFinSF1' are the most influential factors on the price.
train.plot.scatter(x='MSSubClass', y='SalePrice')
train.plot.hexbin(x='MSSubClass', y='SalePrice', gridsize=20)
We can see that this feature is not a good feature since for the same 'MSSubClass', we can have very different house prices.
train.plot.scatter(x='OverallQual', y='SalePrice')
train.plot.hexbin(x='OverallQual', y='SalePrice')
We can observe that this feature is a comparably a good feature since knowing the 'OverallQual', we can have a particular range for the price.
train.plot.scatter(x='BsmtFinSF1', y='SalePrice')
train.plot.hexbin(x='BsmtFinSF1', y='SalePrice', gridsize=20)
This feature is also not a good one because knowing the 'BsmtFinSF1', we have a very wide range for the price.
q8.
Phase 1. In this phase, we preprocess the data to remove lost or wrong datas that might cause eerors. This phase is the most important part of learning.
Phase 1.
q1. We can delete the whole row, if we have a large enough dataset. As the second way, We can also replace the missing value with the average of that feature. This methond is not recommended since in most casses it reduces the variability of the data. There are other methods we can consider too. One of them being educated guessing. For example, if most values of the column has the same value, we can infer that the missing value has the same value as others do. we can also use Regression Substitution, in which we try to predict the missing value based on other values. But we need enough data to be able to form stable regression equations. Another way which is the most complicated but most popular is Multiple Imputation. In this method we take andvantage of the correlations.
q2. We have seen that five features 'LotFrontage', 'GarageYrBlt', 'MasVnrType', 'MasVnrArea', 'Electrical' have lost values. Considering the perventage of lost data from Phase 0 part 2 we see that since four features 'GarageYrBlt', 'MasVnrType', 'MasVnrArea', 'Electrical' have little lost datas, we can remove the faulty rows and for 'LotFrontage' since we have a lot of missing datas, we can replace them with average of the column.
count_nan.sort_values().tail(50)
in_GarageYrBlt = list(np.where(train['GarageYrBlt'].isnull())[0])
in_MasVnrType = list(np.where(train['MasVnrType'].isnull())[0])
in_MasVnrArea = list(np.where(train['MasVnrArea'].isnull())[0])
in_Electrical = list(np.where(train['Electrical'].isnull())[0])
in_LotFrontage = list(np.where(train['LotFrontage'].isnull())[0])
print(len(in_GarageYrBlt), len(in_MasVnrType), len(in_MasVnrArea), len(in_Electrical))
we can see counts of missing datas match with the number of missing datas that we had.
all_toberemoved_index_four_least = in_GarageYrBlt + in_MasVnrType + in_MasVnrArea + in_Electrical
all_toberemoved_index = in_GarageYrBlt + in_MasVnrType + in_MasVnrArea + in_Electrical + in_LotFrontage
Here we remove the duplicate indexes.
all_toberemoved_index_four_least = list(set(all_toberemoved_index_four_least))
all_toberemoved_index = list(set(all_toberemoved_index))
train_preprocess_four_least_removed=train.copy(deep=True)
train_all_lost_remove=train.copy(deep=True)
train_all_replaced=train.copy(deep=True)
train_preprocess_four_least_removed = train_preprocess_four_least_removed.drop(all_toberemoved_index_four_least)
train_preprocess_four_least_removed.tail(10)
We can see that chosen rows have beeb removed. Now we need to replace missing datas from 'LotFrontage' with the average of the column.
train_preprocess_four_least_removed['LotFrontage'].tail(20)
train_preprocess_four_least_removed['LotFrontage'] = train_preprocess_four_least_removed['LotFrontage'].fillna(train_preprocess_four_least_removed['LotFrontage'].mean())
train_preprocess_four_least_removed.tail(10)
We can see the NAs have been replaced with mean of the column. In order to compare the results later on, we need to produce datas with different type of preprocessing. Now we proceed with producing other preprocessed datasets.
train_all_lost_remove = train_all_lost_remove.drop(all_toberemoved_index)
train_all_lost_remove.tail(10)
Here we replace all the numeric feature values with the mean of their column and for nominal features which are 'MasVnrType' and 'Electrical', we replace them with their most repeated value.
print(train_all_replaced['MasVnrType'][529]) #for checking
print(train_all_replaced['Electrical'][1379])
to_replace_nan_string = train_all_replaced['MasVnrType'].value_counts().idxmax()
train_all_replaced['MasVnrType'] = train_all_replaced['MasVnrType'].fillna(to_replace_nan_string)
to_replace_nan_string = train_all_replaced['Electrical'].value_counts().idxmax()
train_all_replaced['Electrical'] = train_all_replaced['Electrical'].fillna(to_replace_nan_string)
print(train_all_replaced['MasVnrType'][529])
print(train_all_replaced['Electrical'][1379])
train_all_replaced['GarageYrBlt'].tail(10)
train_all_replaced['LotFrontage'] = train_all_replaced['LotFrontage'].fillna(train_all_replaced['LotFrontage'].mean())
train_all_replaced['GarageYrBlt'] = train_all_replaced['GarageYrBlt'].fillna(train_all_replaced['GarageYrBlt'].mean())
train_all_replaced['MasVnrArea'] = train_all_replaced['MasVnrArea'].fillna(train_all_replaced['MasVnrArea'].mean())
train_all_replaced['GarageYrBlt'].tail(10)
Now we have 4 different kind of data which will help us with analyzing later on.
q3. Through data normalization, we can get rid of any duplicate data. It also helps clean up the data which would make its analyzation and visualization easier. We can also logically group the datas that are related. There are also other benefits. Normalization helps data take up less space. With less space taken up, our performance could improve. It will be easier to change or update data. we can also use datas from different sources easier and without having t deal with their differences. Standardization helps us compare measurements that have different units. With these explanation and the fact that in this project visualization matters, we need to normalize our dataset. But before we do that, we need to handle non-numeric features.
def normalize(data):
list_of_nominal = list(data.keys()[list(np.where((data.dtypes!=np.object)))])
list_of_nominal.remove('SalePrice')
for col in list_of_nominal:
data[col] = (data[col]-data[col].min())/(data[col].max()-data[col].min())
return data
normalize(train).head()
q4. The answer to this question depends on the kind of nun-numeric feature that we have. For binary featues, we can use replace() function to replace them with one and zero. For Ordinal Feature, we use integer encoding inwhich we convert labels to integer values using map() function from pandas. For nominal features, we use one hot encoding. We should have in mind that integer encoding does not work here, and lead to poor model performance. Having this information we can decide how to encode each feature. Having read the dataset, we can decide that all categorical features be handled with one hot method. We will try deleting all categorical data and replacing them to see how they change the result.
def one_hot_encoding(data):
to_be_encoded = data.keys()[list(np.where((data.dtypes==np.object)))]
for key in to_be_encoded:
data = pd.get_dummies(data,prefix=[key], columns = [key], drop_first=True)
return data
one_hot_encoding(train).keys()
As we can see, the categorical features have been replaced.
def remove_cat(data):
return data.drop(columns=(data.keys()[list(np.where((data.dtypes==np.object)))]))
remove_cat(train)
q5. No, we do not need to keep all the features since there are features that have little or no correlation with the price. Using part 3 of phase 0 we can find these features. For example we can choose features that have a correlation greater than 0.5.
def remove_uncorrelated(data, a): # a shows how much you want to cut if could be a number between 0 and 1 like 0.5
cor_target_train = cor_show(data)
to_removed_features = cor_target_train[cor_target_train<a]
return data.drop(columns=to_removed_features.keys())
q6. For the value of p, there are common split percentages of training data to testing data that we can use like 80% to 20%, 67% to 33% ,and 50% to 50%. If our dataset is not very big, it is better to have more data for training, but if we have a big dataset, we do not have to worry about this problem. We do need to split the dataset randomly. This way, we can make sure that both training data and testin data are the representative of the original dataset. For this part, we can use sklearn.model_selection for splitting.
def splitting(data, test_size_r, random_state_r):
train_to_be_split=data.copy(deep=True)
train_x = train_to_be_split.drop(columns=['SalePrice'])
if 'Id' not in train_x.keys():
train_x = train_to_be_split.drop(columns=['SalePrice'])
train_y = train_to_be_split.pop('SalePrice')
return train_test_split(train_x, train_y, test_size = test_size_r, random_state = random_state_r)
x_train, x_test, y_train, y_test =splitting(remove_cat(train_all_lost_remove),0.33,42)
x_train
y_train
We can see how the data can be splited which will be useful for future analysis.
Phase 2. In this phase we build models and try to predict outcomes based on those models. We also compare the accuray of different models.
q1. In this part we build function that would help caculate accuracy of different models. In this part we only operate on one kind preprocessed data inwhich rows that had nan and categorical features have been removed
x_test.head(1)
def knc_predict(x_train_f, x_test_f, y_train_f, num_neighbors):
scaler = StandardScaler()
scaler.fit(x_train_f)
x_train_f = scaler.transform(x_train_f)
x_test_f = scaler.transform(x_test_f)
knc = KNeighborsClassifier(n_neighbors=num_neighbors)
knc.fit(x_train_f, y_train_f)
return knc.predict(x_test_f)
def dt_predict(x_train_g, x_test_g, y_train_g, l):
dtc = DecisionTreeClassifier(max_depth=l)
dtc.fit(x_train_g, y_train_g)
return dtc.predict(x_test_g)
def lr_predict(x_train_h, x_test_h, y_train_h):
lrc = LinearRegression()
lrc.fit(x_train_h, y_train_h)
return lrc.predict(x_test_h)
def print_error(y_t, y_p):
mse = mean_squared_error(y_t, y_p)
rmse = math.sqrt(mse)
mae = mean_absolute_error(y_t, y_p)
print('rmse',rmse)
print('mae',mae)
def error(yt,yp):
msee = mean_squared_error(yt, yp)
rmsee = math.sqrt(msee)
maee = mean_absolute_error(yt, yp)
return rmsee, maee
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,10)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,100)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
For this example, we can see that linear regression has the least error.
q2. In this part we try to Find the best value for the hyperparameter using grid search.
def dt_hyper(x_train_a, x_test_a, y_train_a, y_test_a):
rmse_scorer = make_scorer(root_mean_squared_error, greater_is_better=False)
pipe_tree = make_pipeline(tree.DecisionTreeRegressor(random_state=1))
depths = np.arange(1,30)
param_grid = [{'decisiontreeregressor__max_depth':depths}]
gs = GridSearchCV(estimator=pipe_tree, param_grid=param_grid, scoring=rmse_scorer)
gs = gs.fit(x_train_a, y_train_a)
print(-gs.best_score_)
print(gs.best_params_)
my_model = gs.best_estimator_
my_model.fit(x_train_a, y_train_a)
y_predicted_a = my_model.predict(x_test_a)
mse = mean_squared_error(y_test_a, y_predicted_a)
rmse = math.sqrt(mse)
return gs.best_params_
bestdt = dt_hyper(x_train, x_test, y_train, y_test)
def knn_hyper(x_train_b, x_test_b, y_train_b, y_test_b):
n_neighbors = list(range(1,30))
hyperparameters = dict(n_neighbors=n_neighbors)
knn_2 = KNeighborsClassifier()
clf = GridSearchCV(knn_2, hyperparameters, cv=10)
best_model = clf.fit(x_train_b,y_train_b)
print(best_model)
print('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])
my_model = clf.best_estimator_
my_model.fit(x_train_b, y_train_b)
y_predicted_b = my_model.predict(x_test_b)
mse = mean_squared_error(y_test_b, y_predicted_b)
rmse = math.sqrt(mse)
return clf.best_estimator_
bestknn = knn_hyper(x_train, x_test, y_train, y_test)
def plot_hyper_dt(x_train_p,y_train_p,x_test_p,y_test_p):
rmse_list_dt=[]
mae_list_dt=[]
for i in range(1,30):
dtc = DecisionTreeClassifier(max_depth=i)
dtc.fit(x_train_p, y_train_p)
y_test_predi = dtc.predict(x_test_p)
rmsee_p , maee_p = error(y_test_p,y_test_predi)
rmse_list_dt.append(rmsee_p)
mae_list_dt.append(maee_p)
x=range(1,30)
plt.plot(x,rmse_list_dt,'r')
plt.plot(x,mae_list_dt,'b')
plot_hyper_dt(x_train,y_train,x_test,y_test)
def plot_hyper_knn(x_train_p,y_train_p,x_test_p,y_test_p):
rmse_list_dt=[]
mae_list_dt=[]
for i in range(1,30):
scaler = StandardScaler()
scaler.fit(x_train_p)
x_train_p = scaler.transform(x_train_p)
x_test_p = scaler.transform(x_test_p)
knc = KNeighborsClassifier(n_neighbors=i)
knc.fit(x_train_p, y_train_p)
y_test_predi = knc.predict(x_test_p)
rmsee_p , maee_p = error(y_test_p,y_test_predi)
rmse_list_dt.append(rmsee_p)
mae_list_dt.append(maee_p)
x=range(1,30)
plt.plot(x,rmse_list_dt,'r')
plt.plot(x,mae_list_dt,'b')
plot_hyper_knn(x_train,y_train,x_test,y_test)
We can see the plots match with grid search.
q3. TO find out if we have underfitting or overfitting, we need to used validation. To to so, we split our data to 2 parts, each time leaving a part out as valisation. Based on the validation scores we decide if out model is fit or not. For knn we have n=1 and for decision tree we have depth=7.
x_train_1, x_test_1, y_train_1, y_test_1 =splitting(remove_cat(train_all_lost_remove),0.5,0)
x_test_2, x_train_2, y_test_2, y_train_2 =splitting(remove_cat(train_all_lost_remove),0.5,0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,1)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,7)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
It can be see that the errors are not considerably different, so we can conclude that our models do not overfit or underfit.
q4. For this part, we will analyze multiple ways of preprocessing and compare their results.
train_preprocess_four_least_removed=train.copy(deep=True) train_all_lost_remove=train.copy(deep=True) train_all_replaced=train.copy(deep=True)
First, we work on the data inwhich all the rows including a missing value have been removed. It has not been normalized, all the gategorical features have been removed, all the features are included and it is split by 80 to 20 for training and testing without randomization.
def cor_show(df):
cor = df.corr()
#Correlation with output variable
cor_target = abs(cor['SalePrice'])
return cor_target
n=1
b=7
x_train, x_test, y_train, y_test = splitting(remove_cat(train_all_lost_remove),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
We try using different ratios for training and testing.
x_train, x_test, y_train, y_test = splitting(remove_cat(train_all_lost_remove),0.33, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
x_train, x_test, y_train, y_test = splitting(remove_cat(train_all_lost_remove),0.5, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
We can see with more training data compared to testing data, we can get better results. Now we repeat the same things only we impelement normalization too.
x_train, x_test, y_train, y_test = splitting(normalize(remove_cat(train_all_lost_remove)),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
x_train, x_test, y_train, y_test = splitting(normalize(remove_cat(train_all_lost_remove)),0.33, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
x_train, x_test, y_train, y_test = splitting(normalize(remove_cat(train_all_lost_remove)),0.5, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
x_train, x_test, y_train, y_test = splitting(normalize(remove_cat(train_all_lost_remove)),0.5, 42)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
We can see that in this dataset randomization produces worse results.
By comparing to not normalized results, we can see that normalization does not affect knn method noticeably. It does worsen the results for decision tree but makes if better for linear regresion. From now on for splitting we use 0.2 for testing since the results show clearly that more training data provides for better results and better generalizaion. Now we try with removing the features that are less correlated with output.
x_train, x_test, y_train, y_test = splitting(remove_uncorrelated(remove_cat(train_all_lost_remove),0.5),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
we can see by leaving out more features, knc becomes better but other features still worsen.
x_train, x_test, y_train, y_test = splitting(remove_uncorrelated(remove_cat(train_all_lost_remove),0.2),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
we can see that knc stays almost the same but other models become worse. This is because we are removing features that are important for decision making.
Now we use hot one encoding instead of removing categorical data.
x_train, x_test, y_train, y_test = splitting(one_hot_encoding(train_all_lost_remove),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
we can see all the models worsen. The conclusion is that for this data set, removing categorical data seems to work better.
x_train, x_test, y_train, y_test = splitting(normalize(one_hot_encoding(train_all_lost_remove)),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
We see normalizign does not change the knc, worsens lrc, but seems to slightly improve dtc.
x_train, x_test, y_train, y_test = splitting(remove_uncorrelated(one_hot_encoding(train_all_lost_remove),0.5),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
In this case, removing less related features has made the results improve vastly. Now we use the data inwhich only the rows are deleted that are from the feature that has the most missing values. Rest of the features are replaced.
x_train, x_test, y_train, y_test = splitting(remove_cat(train_preprocess_four_least_removed),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
x_train, x_test, y_train, y_test = splitting(normalize(remove_cat(train_preprocess_four_least_removed)),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
normalization does not change the results.
x_train, x_test, y_train, y_test = splitting(remove_uncorrelated(remove_cat(train_preprocess_four_least_removed),0.5),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
removing uncorrelated features, improves knc and lrc but worsens dtc.
x_train, x_test, y_train, y_test = splitting(one_hot_encoding(train_preprocess_four_least_removed),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
Encoding only worsens dtc and mekes others work better.
x_train, x_test, y_train, y_test = splitting(remove_uncorrelated(one_hot_encoding(train_preprocess_four_least_removed),0.5),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
Removing less related data only worsens dtc and mekes others work better.
Now we work with the data inwhich all the features have been replaced and no rows have been removen.
x_train, x_test, y_train, y_test = splitting(remove_cat(train_all_replaced),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
With the same preprocessing as other, this data works like the data inwhich rows have been removen. This is probably bacause our data set is big.
x_train, x_test, y_train, y_test = splitting(normalize(remove_cat(train_all_replaced)),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
normalizing does not change the performance.
x_train, x_test, y_train, y_test = splitting(one_hot_encoding(train_all_replaced),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
One hot encoding worsens the results for lrs.
x_train, x_test, y_train, y_test = splitting(remove_uncorrelated(one_hot_encoding(train_all_replaced),0.5),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
x_train, x_test, y_train, y_test = splitting(remove_uncorrelated(remove_cat(train_all_lost_remove),0.5),0.2, 0)
y_test_predicted_knc = knc_predict(x_train, x_test, y_train,n)
y_test_predicted_dtc = dt_predict(x_train, x_test, y_train,b)
y_test_predicted_lrc = lr_predict(x_train, x_test, y_train)
print('knc')
print_error(y_test, y_test_predicted_knc)
print('dtc')
print_error(y_test, y_test_predicted_dtc)
print('lrc')
print_error(y_test, y_test_predicted_lrc)
Removing categorical data sees to make models work better.
In general,normalization does not affect the results. More training data produces better results but choosing random worsens the model. Removing categorical data improves knc but worses others. Encoding only makes dtc better. removing less related features as long as we do not overdo it, improves results.
Phase 3.
q1.In this part, we use random forest to generate a better model.
max_depth_rmse = []
max_depth_mae = []
for i in range(1,30):
x_train, x_test, y_train, y_test = splitting(remove_uncorrelated(remove_cat(train_all_lost_remove),0.5),0.2, 0)
clf = RandomForestClassifier(max_depth=i, random_state=0)
clf.fit(x_train, y_train)
y_test_predicted_knc = clf.predict(x_test)
#print('knc')
#print_error(y_test, y_test_predicted_knc)
rmse, mae= error(y_test, y_test_predicted_knc)
max_depth_rmse.append(rmse)
max_depth_mae.append(mae)
random_state_rmse = []
random_state_mae = []
for k in range(3):
x_train, x_test, y_train, y_test = splitting(remove_uncorrelated(remove_cat(train_all_lost_remove),0.5),0.2, 0)
clf = RandomForestClassifier(max_depth=i, random_state=3*k)
clf.fit(x_train, y_train)
y_test_predicted_knc = clf.predict(x_test)
#print('knc')
#print_error(y_test, y_test_predicted_knc)
rmse, mae= error(y_test, y_test_predicted_knc)
random_state_rmse.append(rmse)
random_state_mae.append(mae)
plt.plot(range(1,30),max_depth_rmse ,label='max_depth vs. rmse')
plt.plot(range(1,30),max_depth_mae,label='max_depth vs. mae')
plt.plot(range(0,30,10),random_state_rmse,label='random_state vs. rmse')
plt.plot(range(0,30,10),random_state_mae,label='random_state vs. mae')
plt.legend()
plt.show()
print(min(random_state_rmse))
print(min(random_state_mae))
we can see after a while the number for both features becomes stable. So, for max_depth we choose 5 and for random_state we choose 0. Compared to the decision trees we have prodiced before, random forest results in better errors.
q2. In this part, we use voting regression to achieve a better model. In this method, we use a model which is an average of the threee models we previously made.
r1 = KNeighborsRegressor(n_neighbors=1)
r2 = DecisionTreeRegressor(max_depth=7)
r3 = LinearRegression()
er = VotingRegressor([('knn', r1), ('dt', r2), ('lr', r3)])
er.fit(x_train, y_train)
y_test_predicted = er.predict(x_test)
print_error(y_test,y_test_predicted)
We can observe that the resutls improve.
q3. In our case, voting regression improves the resutls. This is because using voting, we get the predictions from multiple models and average over them. This results in less error, since different models may work better for some datas and not for others. This way we make sure that we eliminate this factor. By using different models, we can make sure to get good results no matter the model. We can prove this point by comparing the results with other models. Here we use Gaussian Naive Bayes.
g = GaussianNB()
g.fit(x_train, y_train)
y_test_predicted = g.predict(x_test)
print_error(y_test,y_test_predicted)
As we can see, voting regression works better than other models too. This is because by averaging over predictions of multiple models, we get all the features needed to predict a very good result for each data.